Machine Learning (ML) and Topological Data Analysis (TDA) are different approaches to data analysis, each of which has its own strengths and weaknesses relative to the other.
I used the following libraries for the analysis and visualization. I don’t show the code for most of the data cleaning and analysis steps to keep the post concise but the code can be found on Github. TDA package is used for its persistent homology capabilities. The TDAmapper implements the Mapper algorithm. The ggmap to visualize spatial data and models on top of static maps from google.
library(readxl)
library(TDA)
library(dplyr)
library(ggplot2)
library(ggmap)
library(TDAmapper)
library(igraph)
library(geosphere)
library(lubridate)
This collision data consist of 160 observations. The collisions took place between 1899-12-31 06:00:00 and 1899-12-31 18:00:00. New York City encompasses five county-level administrative divisions called boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. The data does not identify the boroughs.
Overall the data file is clean with few missing observations, so here the main data wrangling tasks will include:
The following are derived fields
This data set is balanced with equal amounts of accidents occurring in the afternoon and night.
The total number of accidents increases with time.
The majority of accidents did not involved people being injured.
In this section, the data was clustered using k-means. The number of cluster used are 1 (none) through 7.
In this map we identify a hole that includes east river.
In this map we identify an holes (parallelogram) in Staten Island.
In this map we identify holes in Staten Island, Queens and upper east side of Manhattan.
In this map we identify holes in Staten Island, Queens, lower Manhattan and Harlem.
In this map we identify holes in Staten Island, Queens, lower Manhattan, Harlem and Brooklyn.
In this map we identify holes in Staten Island, Queens, lower Manhattan, Harlem, Brooklyn and Bronx.
In this map we identify an additional holes in Staten Island, Queens, lower Manhattan, Harlem, Brooklyn and Bronx.
In this map we identify two holes that includes east river and Brooklyn.
In this map we identify two holes in Manhattan (afternoon) and Queens/Brooklyn (night).
In this map we identify 20 holes in the 5 Boroughs of New York City.
In this map we identify various holes in the 5 Boroughs of New York City.
In this map we identify various holes in the 5 Boroughs of New York City.
Conducted topological data analysis using mapper from the TDAmapper package. Here are the steps to yield the visualization above:
Apply some map (filter) to the data
Use hierarchical clustering to create a cover
Run clustering algorithm
Represent data clusters as nodes, and connect nodes whose clusters overlap
The data shows a simple silhouette and landscape.
## # Generated complex of size: 682800
##
## 0% 10 20 30 40 50 60 70 80 90 100%
## |----|----|----|----|----|----|----|----|----|----|
## ***************************************************
## # Persistence timer: Elapsed time [ 7.570188 ] seconds
The TDAmapper and TDA was able to provide more granular information on the NYC collision dataset compared to the ML K-means hierarchical clustering methods. The holes identified are the safest areas of NYC where collisions did not take place.
Here is a list of possible future analysis that can be performed by joining the current data set with weather and population area.